1 Chapter 1: Introduction

Team A are the following members: Amal Alqahtani, Jiaxiang Peng, Naureen Elahi, and Xinya Mu. You may find our work over on GitHub.

Coronavirus disease-19 (COVID-19) has spread rapidly around the world, creating unprecedented damage the world was not ready for. To date, the CDC states there are a total of 4,542, 579 cases and 152, 870 deaths in the United States (Cases in U.S, 2020). Many risk factors have been hypothesized to affect the case and death rates from the virus. We felt that a relevant discussion to have would be What are the most regions with the highest number of deaths? What can we say about patient demographics? Is race considered a significant risk factor for increased COVID-19 incidence in the United States?’ Are there any general trends amongst underlying health conditions? These questions are all suited to Exploratory Data Analysis (EDA), and with these questions in mind, we want to see if we could find data on COVID-19 that would be readily available for us to analyze. Eventually, our question morphed into the following: What are the factors (i.e. patient demographics, social determinants of health, environmental variables, underlying health conditions, country of origin) affecting COVID-19 numbers of cases and death rate among different geographical locations in US?

We were able to find a dataset called Covid-19-Dataset on Github over here: https://github.com/johndurbin93/Covid-19-Dataset. This dataset includes COVID-19 confirmed case number and death number through April 14, 2020 which were obtained for each U.S. county from the Center for Systems Science and Engineering (CSSE) Coronavirus Resource Center at Johns Hopkins University. Race demographics for counties was obtained from the County Health Rankings and Roadmaps Program database. Daily temperature data for counties was obtained from the National Oceanic and Atmospheric Administration. This data was compiled by a group of reserchers.

The report is organized as follows:

  1. Description of the Data (explanation of the dataset and its variables),
  2. Demographics Data of the patients
  3. Independent Variables EDA: Slicing the Data for an Overview
  4. Independent Variables EDA: Boxplots, Scatterplots, ANOVA, & Chi-Square
  5. Linear Regression Model
  6. Conclusion

2 Chapter 2: Description of the Data

2.1 Source Data

The data looks like the following:

tibble [3,144 × 82] (S3: tbl_df/tbl/data.frame)
 $ Province                                                                                               : chr [1:3144] "New York City" "Nassau" "Suffolk" "Westchester" ...
 $ State                                                                                                  : chr [1:3144] "New York" "New York" "New York" "New York" ...
 $ Latitude                                                                                               : num [1:3144] 40.8 40.7 40.9 41.2 41.8 ...
 $ Longitude                                                                                              : num [1:3144] -74 -73.6 -72.8 -73.8 -87.8 ...
 $ Tests                                                                                                  : num [1:3144] 499143 499143 499143 499143 110616 ...
 $ Days Since 1st Case                                                                                    : num [1:3144] 44 41 38 43 82 35 41 80 39 38 ...
 $ total_cases                                                                                            : num [1:3144] 110465 25250 22691 20191 16323 ...
 $ deaths                                                                                                 : num [1:3144] 7905 1001 608 596 577 ...
 $ Population (for demographic %'s)                                                                       : chr [1:3144] "8623000" "1358343" "1481093" "967612" ...
 $ % less than 18 years of age                                                                            : chr [1:3144] "20.9" "21.459675499999999" "21.134324500000002" "21.900513799999999" ...
 $ % 65 and over                                                                                          : chr [1:3144] "14.1" "17.763039200000001" "16.862951899999999" "17.053116299999999" ...
 $ % Black                                                                                                : chr [1:3144] "24.3" "11.6331442" "7.3924459799999998" "13.8042935" ...
 $ % American Indian & Alaska Native                                                                      : chr [1:3144] "0.4" "0.54294092000000005" "0.61373593999999998" "0.95647842000000005" ...
 $ % Asian                                                                                                : chr [1:3144] "13.9" "10.4504532" "4.1896086199999996" "6.43553408" ...
 $ % Native Hawaiian/Other Pacific Islander                                                               : chr [1:3144] "0.1" "0.1" "9.5899999999999999E-2" "0.13228443000000001" ...
 $ % Hispanic                                                                                             : chr [1:3144] "29.1" "17.231362000000001" "19.775260599999999" "25.140345499999999" ...
 $ % Non-Hispanic White                                                                                   : chr [1:3144] "32.1" "59.333835399999998" "67.190378999999993" "53.088118000000001" ...
 $ % Not Proficient in English                                                                            : chr [1:3144] "9" "5.3660427200000003" "4.00639637" "6.3180527499999997" ...
 $ % Female                                                                                               : chr [1:3144] "52.3" "51.306334300000003" "50.771693599999999" "51.559095999999997" ...
 $ % Rural                                                                                                : chr [1:3144] "0" "0.19223132000000001" "2.6011316799999999" "3.2734774500000001" ...
 $ Population Density per Square mile of Land (2010)                                                      : num [1:3144] 69468 4705 1637 2205 5495 ...
 $ Housing Density Per Square Mile of Land                                                                : num [1:3144] 37106 1645 625 861 2306 ...
 $ Avg Daily March 2011 Sunlight (KJ/m²) Missing HI and AK                                                : num [1:3144] 16233 16649 16539 15031 14299 ...
 $ GDP 2018                                                                                               : num [1:3144] 600244287 81196003 81211899 73404644 362063569 ...
 $ GDP/capita                                                                                             : num [1:3144] 69.6 59.8 54.8 75.9 69.9 ...
 $ Percentage Living in Poverty, All Ages, 2016                                                           : num [1:3144] 17.2 6.1 7.6 10 15 22.9 6.9 16.3 14.4 15.6 ...
 $ Air Quality, Annual Average Ambient Concentrations of PM2.5, 2014                                      : chr [1:3144] "10.8" "10" "9" "10.4" ...
 $ Primary Care Physicians Ratio                                                                          : chr [1:3144] "31.417361111111109" "29.834027777777777" "56.709027777777777" "30.334027777777777" ...
 $ Dentist Ratio                                                                                          : chr [1:3144] "23.334027777777777" "34.500694444444441" "50.209027777777777" "37.834027777777777" ...
 $ Mental Health Provider Ratio                                                                           : chr [1:3144] "4.834027777777778" "13.792361111111111" "15.625694444444443" "10.750694444444443" ...
 $ High School Graduation Rate                                                                            : chr [1:3144] "74.536495200000005" "90.769602500000005" "89.560601599999998" "89.554779199999999" ...
 $ % Some College                                                                                         : chr [1:3144] "84.074597800000006" "75.579882699999999" "67.067068199999994" "71.893479799999994" ...
 $ % Unemployed                                                                                           : chr [1:3144] "3.6720665100000001" "3.5355112100000001" "3.8509406199999998" "3.8880261699999998" ...
 $ % Children in Poverty                                                                                  : chr [1:3144] "19.7" "7.6" "9.4" "10.3" ...
 $ Income Inequality Ratio (80th%/20th%)                                                                  : chr [1:3144] "9.2065919600000008" "4.5137498100000002" "4.3752126000000002" "6.18534249" ...
 $ % Single-Parent Households                                                                             : chr [1:3144] "39.575203500000001" "19.238140600000001" "23.6102569" "25.424638399999999" ...
 $ Social Association Rate                                                                                : chr [1:3144] "12.8789886" "7.9882352399999998" "6.7450214400000004" "8.3754656999999995" ...
 $ Violent Crime Rate                                                                                     : chr [1:3144] "586.40744800000004" "143.663387" "124.039181" "220.606166" ...
 $ Air pollution: Average Daily PM2.5                                                                     : chr [1:3144] "10.8" "10" "9" "10.4" ...
 $ Presence of Drinking Water Violation                                                                   : chr [1:3144] "No" "No" "No" "Yes" ...
 $ % Severe Housing Problems                                                                              : chr [1:3144] "24.378637699999999" "21.324080599999998" "22.888761800000001" "24.236306200000001" ...
 $ Housing: Severe Cost Burden                                                                            : chr [1:3144] "19.610767299999999" "19.1674103" "20.427237699999999" "20.895964899999999" ...
 $ Housing: Overcrowding                                                                                  : chr [1:3144] "5.4547143900000004" "2.5236808000000002" "2.6421104199999998" "4.2602996299999996" ...
 $ Housing: Inadequate Facilities                                                                         : chr [1:3144] "1.2204915199999999" "0.72802853000000001" "0.78609931" "0.73443351999999995" ...
 $ % Drive Alone to Work                                                                                  : chr [1:3144] "6.0475223400000004" "68.609857500000004" "79.604339800000005" "57.587820100000002" ...
 $ % Long Commute - Drives Alone                                                                          : chr [1:3144] "66.7" "45.7" "41.9" "41.2" ...
 $ Sleep <7 Hours_Percent                                                                                 : chr [1:3144] "NA" "38.049835600000002" "35.608102700000003" "33.101763800000001" ...
 $ Sleep <7 Hours_CI_Low                                                                                  : chr [1:3144] "NA" "37.488497199999998" "34.960704200000002" "32.608553999999998" ...
 $ Sleep <7 Hours_CI_High                                                                                 : chr [1:3144] "NA" "38.576512200000003" "36.198949800000001" "33.594731299999999" ...
 $ Diabetes Total Percentage                                                                              : num [1:3144] 6.5 7.2 6.8 6.4 9 10.3 6.8 8.1 6.9 8.2 ...
 $ Diabetes Male Percentage                                                                               : num [1:3144] 6.7 8.5 7.7 6.7 9.7 10.7 7.1 8.6 7 8.7 ...
 $ Diabetes Female Percentage                                                                             : num [1:3144] 6.3 6.2 6 6.2 8.4 10.1 6.6 7.7 6.8 7.9 ...
 $ Coronary Heart Disease Death Rate per 100,000, All Ages, All Races/Ethnicities, Both Genders, 2014-2016: num [1:3144] 100.4 142.4 120.1 97.6 95.2 ...
 $ Hypertension Death Rate per 100,000 (any mention), 35+, All Races/Ethnicities, Both Genders, 2014-2016 : num [1:3144] 232 153 181 124 191 ...
 $ Obesity, Age-Adjusted Percentage, 20+. 2015                                                            : num [1:3144] 15.9 22.5 23.6 20.2 27.2 34.1 22.4 21.2 23.4 23.9 ...
 $ % Fair or Poor Health                                                                                  : chr [1:3144] "15.610279800000001" "12.0544118" "13.0711332" "14.8011888" ...
 $ Average Number of Physically Unhealthy Days                                                            : chr [1:3144] "3.5938226700000002" "2.8691053700000002" "3.1473144999999998" "3.1513169799999998" ...
 $ Average Number of Mentally Unhealthy Days                                                              : chr [1:3144] "3.97126146" "3.4601849699999998" "3.9316660200000002" "3.9107989299999999" ...
 $ % Low Birthweight                                                                                      : chr [1:3144] "8.2870096600000007" "7.8873580399999996" "7.7408509700000003" "7.95359718" ...
 $ % Smokers (adults)                                                                                     : chr [1:3144] "12.418234200000001" "11.225364600000001" "12.625481499999999" "11.371546" ...
 $ % Adults with Obesity                                                                                  : chr [1:3144] "14.6" "23.6" "24.6" "20.7" ...
 $ Food Environment Index                                                                                 : chr [1:3144] "8.3000000000000007" "9.6999999999999993" "9.3000000000000007" "9.1" ...
 $ % Physically Inactive                                                                                  : chr [1:3144] "17.5" "22.8" "24.2" "21.2" ...
 $ % With Access to Exercise Opportunities                                                                : chr [1:3144] "100" "98.858183299999993" "93.3366592" "99.621119899999997" ...
 $ % Excessive Drinking                                                                                   : chr [1:3144] "24.812851999999999" "18.439903699999999" "18.671426799999999" "18.011370899999999" ...
 $ % Uninsured                                                                                            : chr [1:3144] "6.15572813" "5.32768102" "5.4469207300000004" "6.9293390300000004" ...
 $ Preventable Hospitalization Rate (Preventable hospital stays)                                          : chr [1:3144] "3082" "3588" "4339" "3870" ...
 $ % With Annual Mammogram                                                                                : chr [1:3144] "39" "45" "42" "46" ...
 $ % Flu Vaccinated                                                                                       : chr [1:3144] "46" "52" "51" "51" ...
 $ Chronic Respiratory Disease: mortality rate per 100K (2014)                                            : chr [1:3144] "23.47" "29.03" "38.590000000000003" "31.82" ...
 $ Liver Disease: crude mortality rate per 100K (1999-2018)                                               : chr [1:3144] "7.3202151400000002" "7.8321364999999998" "9.3999156199999998" "8.6888457900000002" ...
 $ Liver Disease: % of Total Deaths (1999-2018)                                                           : chr [1:3144] "2.72836E-3" "2.4593800000000002E-3" "3.2464299999999998E-3" "1.92728E-3" ...
 $ Liver Disease: crude mortality rate per 100K (2018)                                                    : chr [1:3144] "6.6924499900000001" "8.7606738499999999" "11.478009800000001" "8.9912072199999997" ...
 $ Liver Disease: % of Total Deaths (2018)                                                                : chr [1:3144] "1.9492800000000001E-3" "2.1281199999999998E-3" "3.04017E-3" "1.5558499999999999E-3" ...
 $ Avg Temp Peak Growth-10 Rate                                                                           : num [1:3144] 8.41 7.41 6.86 5.88 2.25 ...
 $ Avg Temp 10 Before First-Current                                                                       : num [1:3144] 8.33 7.78 7.1 6.68 2.55 ...
 $ Avg Temp First-Current                                                                                 : num [1:3144] 9.23 8.36 7.82 7.53 3.34 ...
 $ First Case                                                                                             : POSIXct[1:3144], format: "2020-03-02" "2020-03-05" ...
 $ Stay At Home                                                                                           : POSIXct[1:3144], format: "2020-03-22" "2020-03-22" ...
 $ No Cases                                                                                               : num [1:3144] 0 0 0 0 0 0 0 0 0 0 ...
 $ No Stay At Home Order                                                                                  : num [1:3144] 0 0 0 0 0 0 0 0 0 0 ...
 $ Stay At Home Order After First Case                                                                    : num [1:3144] 1 1 1 1 1 1 1 1 1 1 ...

The Covid19 dataset has 82 columns and 3144 rows/entries, for a total of 257808 individual data points. Out of 82, we select the following variables to do EDA:

  1. Province
  2. State
  3. State Code
  4. Tests
  5. Total cases
  6. Deaths
  7. Population (for demographic %’s)
  8. % less than 18 years of age
  9. % 65 and over
  10. % Black
  11. % American Indian & Alaska Native
  12. % Asian
  13. % Native Hawaiian/Other Pacific Islander
  14. % Hispanic
  15. % Non-Hispanic White
  16. % Not Proficient in English
  17. % Female
  18. No Cases
  19. No Stay At Home Order
  20. Stay At Home Order After First Case
  21. Percentage Living in Poverty
  22. Social Association Rate

To prepare our data for EDA we clean the dataset and remove all NAs.

tibble [3,144 × 16] (S3: tbl_df/tbl/data.frame)
 $ Province                                : chr [1:3144] "New York City" "Nassau" "Suffolk" "Westchester" ...
 $ State                                   : chr [1:3144] "New York" "New York" "New York" "New York" ...
 $ total_cases                             : num [1:3144] 110465 25250 22691 20191 16323 ...
 $ deaths                                  : num [1:3144] 7905 1001 608 596 577 ...
 $ Population (for demographic %'s)        : chr [1:3144] "8623000" "1358343" "1481093" "967612" ...
 $ % less than 18 years of age             : chr [1:3144] "20.9" "21.459675499999999" "21.134324500000002" "21.900513799999999" ...
 $ % 65 and over                           : chr [1:3144] "14.1" "17.763039200000001" "16.862951899999999" "17.053116299999999" ...
 $ % Black                                 : chr [1:3144] "24.3" "11.6331442" "7.3924459799999998" "13.8042935" ...
 $ % American Indian & Alaska Native       : chr [1:3144] "0.4" "0.54294092000000005" "0.61373593999999998" "0.95647842000000005" ...
 $ % Asian                                 : chr [1:3144] "13.9" "10.4504532" "4.1896086199999996" "6.43553408" ...
 $ % Native Hawaiian/Other Pacific Islander: chr [1:3144] "0.1" "0.1" "9.5899999999999999E-2" "0.13228443000000001" ...
 $ % Hispanic                              : chr [1:3144] "29.1" "17.231362000000001" "19.775260599999999" "25.140345499999999" ...
 $ % Non-Hispanic White                    : chr [1:3144] "32.1" "59.333835399999998" "67.190378999999993" "53.088118000000001" ...
 $ % Not Proficient in English             : chr [1:3144] "9" "5.3660427200000003" "4.00639637" "6.3180527499999997" ...
 $ % Female                                : chr [1:3144] "52.3" "51.306334300000003" "50.771693599999999" "51.559095999999997" ...
 $ poor_health                             : chr [1:3144] "15.610279800000001" "12.0544118" "13.0711332" "14.8011888" ...

3 Chapter 3: Independent Variables EDA

3.1 United States COVID-19 Cases and Deaths by Provinces (Cities)

3.1.1 What are the top 15 Provinces based on the number of cases?

The following bar chart shows the top 15 cities by number of Covid-19 cases.

The above Bar chart shows the top 15 provinces determined by the number of cases. New York province is highest city with number of covid19 cases, the total number is over 100000, while the number of cases in other cities is less than 30000.

3.1.2 What are the top 15 Provinces based on the number of deaths?

The following bar chart shows the top 15 cities by number of deaths.

The above Bar chart shows the top 15 provinces determined by the number of deaths. New York province is highest city with number of deaths around 8000, while the number of deaths in other cities is less than 1000.

3.1.3 What are the top 15 States based on the number of Tests?

The above Bar chart shows the top 15 States determined by the number of tests. It can be clearly seen that the number of tests has been done in New York State is around 499,143 tests which is considered to be the highest among the other states. Furthermore, the number of test has been done in other states is less than 200k.

3.1.4 What is the average cases for each State?

                  State total_cases
1               Alabama       59.03
2                Alaska        9.83
3               Arizona      258.60
4              Arkansas       19.44
5            California      437.43
6              Colorado      122.41
7           Connecticut     1682.00
8              Delaware      638.33
9  District of Columbia     2058.00
10              Florida      323.07
11              Georgia       85.74
12               Hawaii      101.60
13                Idaho       33.32
14             Illinois      227.48
15              Indiana       94.12
16                 Iowa       19.20
17               Kansas       13.84
18             Kentucky       17.32
19            Louisiana      335.34
20                Maine       45.88
21             Maryland      394.75
22        Massachusetts     1843.87
23             Michigan      316.96
24            Minnesota       19.10
25          Mississippi       37.68
26             Missouri       40.98
27              Montana        7.18
28             Nebraska        9.48
29               Nevada      184.35
30        New Hampshire      103.50
31           New Jersey     3196.29
32           New Mexico       40.82
33             New York     3274.52
34       North Carolina       51.20
35         North Dakota        6.45
36                 Ohio       82.81
37             Oklahoma       28.51
 [ reached 'max' / getOption("max.print") -- omitted 14 rows ]

3.1.5 What is the average deaths for each State?

                  State  deaths
1               Alabama   1.701
2                Alaska   0.172
3               Arizona   7.133
4              Arkansas   0.427
5            California  13.328
6              Colorado   5.109
7           Connecticut  83.375
8              Delaware  14.333
9  District of Columbia  67.000
10              Florida   7.836
11              Georgia   3.270
12               Hawaii   1.800
13                Idaho   0.750
14             Illinois   8.510
15              Indiana   4.207
16                 Iowa   0.444
17               Kansas   0.657
18             Kentucky   0.900
19            Louisiana  15.922
20                Maine   1.250
21             Maryland  12.667
22        Massachusetts  49.600
23             Michigan  21.133
24            Minnesota   0.920
25          Mississippi   1.366
26             Missouri   1.284
27              Montana   0.143
28             Nebraska   0.161
29               Nevada   7.059
30        New Hampshire   0.300
31           New Jersey 133.476
32           New Mexico   0.939
33             New York 174.871
34       North Carolina   1.130
35         North Dakota   0.151
36                 Ohio   3.705
37             Oklahoma   1.403
 [ reached 'max' / getOption("max.print") -- omitted 14 rows ]

3.1.6 Which cities had the greatest % of population of people with poor health?

3.2 Patient Demographics

3.2.1 What are the patient demographics?

[1] "/Users/amalalqahtani/Desktop/Data_Science Project"
Table: Statistics summary.
TC Population young old black AIAN Asian NH Hispanic NHW Female Poverty Social
Min 0 88 0.0 4.8 0.0 0.0 0.0 0.0 0.6 2.7 26.8 3.4 0.0
Q1 2 11034 20.1 16.3 0.7 0.4 0.5 0.0 2.4 64.7 49.4 11.4 8.2
Median 9 25758 22.1 19.0 2.2 0.6 0.7 0.1 4.4 83.5 50.3 14.8 11.1
Mean 191 105871 22.1 19.3 8.8 2.4 1.5 0.1 9.6 76.2 49.9 15.9 11.6
Q3 39 67013 23.8 21.8 9.6 1.3 1.4 0.1 9.9 92.3 51.0 19.0 14.4
Max 110465 10105518 42.0 57.6 85.4 92.5 43.4 48.9 96.4 97.9 56.9 48.6 52.3

From the average of the output results, we can see that the average proportion of teenagers under the age of 18 is 22.1%, and the average proportion of people over 65 is 19.3%. The largest number of all races is Non-Hispanic White, with an average proportion of 76.2. The average proportion of women is 49.9, the average proportion of the poor is 15.9%, and the average of the Social Association Rate is 11.6. We divide the data into four levels according to total cases.

3.2.2 Which race is the majority of the sample?

According to the average value, we get a pie chart of race proportions, from which we can see the overall proportions of different races. In the following, we will study the proportion of which race is related to the number of confirmed cases and the number of deaths.

3.3 Stay at home policy in each province

3.4 Underlying Health Conditions

3.4.1 Are there any common underlying health conditions?

3.4.2 Does any disease relate to the death rate?

It shows liver_total_death is highly correlated to deaths at correlation = 0.4338.


Call:
lm(formula = deaths ~ sleep_hour + sleep_hour_high + Low_birthweight + 
    heart_disease + smokers + adult_obesity + Food_environment + 
    Respiratory + liver_Total_death, data = disease3)

Residuals:
   Min     1Q Median     3Q    Max 
-135.4  -18.2   -3.9   10.4  859.3 

Coefficients:
                    Estimate Std. Error t value             Pr(>|t|)    
(Intercept)        -162.8499    33.5760   -4.85            0.0000014 ***
sleep_hour           36.8056    10.1538    3.62               0.0003 ***
sleep_hour_high     -32.7256    10.1426   -3.23               0.0013 ** 
Low_birthweight       4.1580     1.3226    3.14               0.0017 ** 
heart_disease         0.3083     0.0705    4.38            0.0000134 ***
smokers              -1.2021     0.9605   -1.25               0.2110    
adult_obesity        -1.3285     0.4261   -3.12               0.0019 ** 
Food_environment     12.4635     2.5100    4.97            0.0000008 ***
Respiratory          -0.5801     0.1467   -3.96            0.0000817 ***
liver_Total_death 12460.4546  1307.0073    9.53 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 56 on 1027 degrees of freedom
Multiple R-squared:  0.28,  Adjusted R-squared:  0.274 
F-statistic: 44.4 on 9 and 1027 DF,  p-value: <0.0000000000000002
       sleep_hour   sleep_hour_high   Low_birthweight     heart_disease 
           436.20            443.72              2.12              1.42 
          smokers     adult_obesity  Food_environment       Respiratory 
             3.27              1.89              2.14              1.65 
liver_Total_death 
             1.39 

Call:
lm(formula = deaths ~ sleep_hour + +Low_birthweight + heart_disease + 
    adult_obesity + Food_environment + Respiratory + liver_Total_death, 
    data = disease5)

Residuals:
   Min     1Q Median     3Q    Max 
-161.0  -18.2   -4.8    8.7  872.0 

Coefficients:
                    Estimate Std. Error t value             Pr(>|t|)    
(Intercept)        -181.1828    33.3093   -5.44         0.0000000668 ***
sleep_hour            3.6256     0.6217    5.83         0.0000000073 ***
Low_birthweight       4.4785     1.3075    3.43              0.00064 ***
heart_disease         0.2550     0.0686    3.72              0.00021 ***
adult_obesity        -1.6309     0.4196   -3.89              0.00011 ***
Food_environment     12.8099     2.4045    5.33         0.0000001223 ***
Respiratory          -0.7153     0.1398   -5.12         0.0000003714 ***
liver_Total_death 14469.7867  1206.9498   11.99 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 56.4 on 1029 degrees of freedom
Multiple R-squared:  0.269, Adjusted R-squared:  0.264 
F-statistic: 54.1 on 7 and 1029 DF,  p-value: <0.0000000000000002
       sleep_hour   Low_birthweight     heart_disease     adult_obesity 
             1.61              2.04              1.33              1.81 
 Food_environment       Respiratory liver_Total_death 
             1.94              1.48              1.17 

3.5 Impact of Temperature

3.5.1 Does the temperature relate the Total Cases or Death Rate?

tibble [3,144 × 8] (S3: tbl_df/tbl/data.frame)
 $ Province    : chr [1:3144] "New York City" "Nassau" "Suffolk" "Westchester" ...
 $ State       : chr [1:3144] "New York" "New York" "New York" "New York" ...
 $ days        : num [1:3144] 44 41 38 43 82 35 41 80 39 38 ...
 $ total_cases : num [1:3144] 110465 25250 22691 20191 16323 ...
 $ deaths      : num [1:3144] 7905 1001 608 596 577 ...
 $ temp_peak   : num [1:3144] 8.41 7.41 6.86 5.88 2.25 ...
 $ temp_before : num [1:3144] 8.33 7.78 7.1 6.68 2.55 ...
 $ temp_current: num [1:3144] 9.23 8.36 7.82 7.53 3.34 ...

By the correlation diagorm, the temperature is less relate to total_cases and deaths.

4 Chapter 4: Independent Variables EDA: Boxplots, Scatterplots, ANOVA, & Chi-Square

First, we divide the total cases into different levels according to the quartile value and the median for further analysis.

[1]      0      2      9     39 110465

    Shapiro-Wilk normality test

data:  df2$TC
W = 0.05, p-value <0.0000000000000002

    Bartlett test of homogeneity of variances

data:  TC by rank
Bartlett's K-squared = 25341, df = 3, p-value <0.0000000000000002

The Shapiro-Wilk test is used to test whether the data conforms to the normal distribution.

H0: The sample data is not significantly different from the normal distribution

H1: The sample data is significantly different from the normal distribution

The p-value is less than 0.05, the null hypothesis is rejected, and the total cases do not conform to the normal distribution.

Test for homogeneity of variance(Bartlett test)

H0: Data with the same variance at different levels

H1: Data without the same variance at different levels

The result shows that the p value is less than 0.05, rejecting the null hypothesis, and total cases do not meet the homogeneity of variance.

5 Chapter 5: Linear Regression Model

5.1 SMART Question: What factors influence the death rate the most?

5.2 Which disease variables best explian death rate?

Use exhaustive method for feature selection


Call:
lm(formula = deaths ~ sleep_hour + obesity_age + liver_crude_mortality + 
    liver_Total_death, data = disease2)

Residuals:
   Min     1Q Median     3Q    Max 
-180.8  -15.9   -5.1    5.4  908.7 

Coefficients:
                       Estimate Std. Error t value             Pr(>|t|)    
(Intercept)             -27.370     17.644   -1.55                 0.12    
sleep_hour                3.907      0.585    6.67        0.00000000004 ***
obesity_age              -2.444      0.432   -5.66        0.00000001953 ***
liver_crude_mortality    -1.589      0.377   -4.22        0.00002661895 ***
liver_Total_death     15637.213   1207.899   12.95 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 57.3 on 1038 degrees of freedom
Multiple R-squared:  0.238, Adjusted R-squared:  0.235 
F-statistic:   81 on 4 and 1038 DF,  p-value: <0.0000000000000002
           sleep_hour           obesity_age liver_crude_mortality 
                 1.39                  1.51                  1.03 
    liver_Total_death 
                 1.14 

Call:
lm(formula = deaths ~ sleep_hour + heart_disease + smokers + 
    adult_obesity + liver_crude_mortality + liver_Total_death, 
    data = disease2)

Residuals:
   Min     1Q Median     3Q    Max 
-169.4  -17.2   -4.9    7.4  883.9 

Coefficients:
                        Estimate Std. Error t value             Pr(>|t|)    
(Intercept)             -42.2757    17.6272   -2.40              0.01665 *  
sleep_hour                4.6860     0.6373    7.35     0.00000000000039 ***
heart_disease             0.2544     0.0696    3.66              0.00027 ***
smokers                  -3.3650     0.8264   -4.07     0.00005022310760 ***
adult_obesity            -1.8016     0.4169   -4.32     0.00001697342422 ***
liver_crude_mortality    -1.4158     0.3928   -3.60              0.00033 ***
liver_Total_death     14723.3190  1225.3526   12.02 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 56.8 on 1036 degrees of freedom
Multiple R-squared:  0.253, Adjusted R-squared:  0.249 
F-statistic: 58.5 on 6 and 1036 DF,  p-value: <0.0000000000000002
           sleep_hour         heart_disease               smokers 
                 1.67                  1.36                  2.36 
        adult_obesity liver_crude_mortality     liver_Total_death 
                 1.77                  1.15                  1.19 

For BIC model selection: sleep_hour, obesity_age, liver_crude_mortality and liver_total_death are best variable selection.

ALL P_values <<0.05 and VIF values all < 5, good model.

For Adjusted R^2: Sleep_hour, heart_disease, smokers, adult_obesity, liver_crude_mortality and liver_total_death are best variable selection

ALL P_values <<0.05 and VIF values all < 5, good model.

                      Abbreviation
sleep_hour                      s_
diabetes                         d
heart_disease                    h
obesity_age                      o
smokers                         sm
adult_obesity                    a
excessive_drink                  e
liver_crude_mortality          l__
liver_Total_death              l_T

By Mallow Cp line, sleep_hour, heart_disease, smokers, adult_obseity, liver_crude_mortality, liver_total_death are best selection, which is exactly Adjusted R^2 model.

6 Chapter 6: LASSO & Ridge Regression

We convert the two variables Presence of Drinking Water Violation and Stay At Home Order After First Case into categorical variables. Then began to fit LASSO regression and ridge regression. We normalize the data, then split the data into training and test set, so that we can estimate test errors. The split will be used here for Lasso and later for Ridge regression. For brevity, we selected 34 variables for the following analysis.

6.1 LASSO Regression

We draw the plot for different \(\lambda\) values to see the overall trend.

[1] "/Users/amalalqahtani/Desktop/Data_Science Project"

lowest lamda from CV:  0.00246 

We see that the lowest MSE is when \(\lambda\) appro = 0.002.

Mean MSE for best Lasso lamda:  0.203 

All the coefficients : 
       (Intercept)         population              young                old 
          -0.00301            0.26499            0.03709            0.02258 
             black               AIAN              Asian                 NH 
           0.00000           -0.00339           -0.09691           -0.00689 
          Hispanic                NHW             Female              Rural 
          -0.00461            0.00751           -0.01883            0.02263 
Population.Density 
           0.11744 

The non-zero coefficients : 
       (Intercept)         population              young                old 
          -0.00301            0.26499            0.03709            0.02258 
              AIAN              Asian                 NH           Hispanic 
          -0.00339           -0.09691           -0.00689           -0.00461 
               NHW             Female              Rural Population.Density 
           0.00751           -0.01883            0.02263            0.11744 

From LASSO regression, the coefficients of 11 variables are not zero, the coefficients of the remaining variables become zero. From the results, we can see that race, gender, age, population, population density and rural proportions will all have an impact on total cases.

We then calculate the R squared of lasso regression, which is 0.164.

6.2 Ridge Regression

[1]  33 100

       (Intercept)         population              young                old 
      -0.000006691        0.000051520       -0.000000684       -0.000005816 
             black               AIAN              Asian                 NH 
       0.000003757       -0.000001430        0.000020005        0.000001083 
          Hispanic                NHW             Female              Rural 
       0.000006572       -0.000009388        0.000004852       -0.000012388 
Population.Density    Housing.Density           Sunlight                GDP 
       0.000077406        0.000078049        0.000001705        0.000050606 
           Poverty         Unemployed   Children.Poverty  Income.Inequality 
      -0.000002654       -0.000001358       -0.000003231        0.000013610 
            Social              PM2.5           WaterYes                SHP 
      -0.000003449        0.000003749        0.000001931        0.000014048 
        poorhealth     Unhealthy.Days            smokers            Obesity 
      -0.000003469       -0.000004822       -0.000007770       -0.000011988 
    Physically.ina               WAEO                CRD               Temp 
      -0.000007008        0.000010306       -0.000010943       -0.000001884 
            Order1 
       0.000010345 
       (Intercept)         population              young                old 
        -0.0001074          0.0008357         -0.0000112         -0.0000937 
             black               AIAN              Asian                 NH 
         0.0000609         -0.0000231          0.0003231          0.0000169 
          Hispanic                NHW             Female              Rural 
         0.0001059         -0.0001515          0.0000785         -0.0001996 
Population.Density    Housing.Density           Sunlight                GDP 
         0.0012570          0.0012672          0.0000273          0.0008199 
           Poverty         Unemployed   Children.Poverty  Income.Inequality 
        -0.0000426         -0.0000217         -0.0000518          0.0002207 
            Social              PM2.5           WaterYes                SHP 
        -0.0000554          0.0000606          0.0000306          0.0002268 
        poorhealth     Unhealthy.Days            smokers            Obesity 
        -0.0000558         -0.0000776         -0.0001251         -0.0001935 
    Physically.ina               WAEO                CRD               Temp 
        -0.0001124          0.0001658         -0.0001765         -0.0000306 
            Order1 
         0.0001663 

Because the ridge regression uses the “L2 norm”, the coefficients are expected to be smaller when \(\lambda\) is large. Our “mid-point” (the 50-th) of \(\lambda\) equals to 11498, and the sum of squares of coefficients = 0. Compared to the 60-th value (we have a decreasing sequence) \(\lambda\) of = 705, we find the sum of squares of the coefficients to be 0, about 16 times larger.

We can use the predict function for various purposes, such as getting the predicted coefficients for \(\lambda\)=50, for example.

       (Intercept)         population              young                old 
        -0.0012334          0.0110737         -0.0001707         -0.0010885 
             black               AIAN              Asian                 NH 
         0.0007856         -0.0002850          0.0039843          0.0000888 
          Hispanic                NHW             Female              Rural 
         0.0012624         -0.0018608          0.0009939         -0.0023810 
Population.Density    Housing.Density           Sunlight                GDP 
         0.0170830          0.0172559          0.0002979          0.0108541 
           Poverty         Unemployed   Children.Poverty  Income.Inequality 
        -0.0004844         -0.0002253         -0.0005805          0.0029666 
            Social              PM2.5           WaterYes                SHP 
        -0.0006430          0.0007782          0.0002744          0.0028497 
        poorhealth     Unhealthy.Days            smokers            Obesity 
        -0.0006499         -0.0009101         -0.0014932         -0.0024228 
    Physically.ina               WAEO                CRD               Temp 
        -0.0012813          0.0019933         -0.0022012         -0.0004192 
            Order1 
         0.0019621 

Then we use the separated training set and test set to see the test error.

The test set mean squared error (MSE) is 0.174. (We are using standardized scores for \(\lambda = 4\).)

On the other hand, for the null model (\(\lambda\) approaches infinity), the MSE can be found to be 0.244. So \(\lambda = 4\) reduces the variance by about half, at the expense of bias.

We could have also used a large \(\lambda\) value to find the MSE for the null model. These two methods yield essentially the same answer of 0.244.

       (Intercept)         population              young                old 
          -0.00192            0.37063            0.03949            0.01311 
             black               AIAN              Asian                 NH 
           0.10907            0.04541           -0.15188            0.00242 
          Hispanic                NHW             Female              Rural 
           0.07496            0.17384           -0.02449            0.03603 
Population.Density    Housing.Density           Sunlight                GDP 
           0.08226            0.73164           -0.00993           -0.07849 
           Poverty         Unemployed   Children.Poverty  Income.Inequality 
          -0.00886            0.00629           -0.05258            0.03361 
            Social              PM2.5           WaterYes                SHP 
           0.00193           -0.00942            0.02051            0.03758 
        poorhealth     Unhealthy.Days            smokers            Obesity 
           0.14088           -0.05720           -0.06817           -0.00413 
    Physically.ina               WAEO                CRD               Temp 
           0.01950           -0.01367            0.01171           -0.03587 
            Order1 
          -0.02147 

Now for the other extreme special case of small \(\lambda\), which is the ordinary least square (OLS) model. We can first use the ridge regression result to predict the \(\lambda\) =0 case. The MSE was found to be 0.201 using this result.

We can also build the OLS model directly, caculate MSE.


Call:
lm(formula = TC ~ ., data = train)

Residuals:
   Min     1Q Median     3Q    Max 
-5.814 -0.051  0.005  0.062  6.966 

Coefficients:
                   Estimate Std. Error t value             Pr(>|t|)    
(Intercept)        -0.00191    0.01929   -0.10               0.9211    
population          0.37068    0.03979    9.32 < 0.0000000000000002 ***
young               0.03947    0.01958    2.02               0.0440 *  
old                 0.01310    0.02219    0.59               0.5552    
black               0.11054    0.18067    0.61               0.5408    
AIAN                0.04602    0.07642    0.60               0.5472    
Asian              -0.15159    0.03665   -4.14             0.000038 ***
NH                  0.00245    0.01160    0.21               0.8328    
Hispanic            0.07621    0.15610    0.49               0.6255    
NHW                 0.17577    0.23888    0.74               0.4620    
Female             -0.02448    0.01371   -1.79               0.0743 .  
Rural               0.03601    0.01809    1.99               0.0468 *  
Population.Density  0.08178    0.07716    1.06               0.2894    
Housing.Density     0.73210    0.07416    9.87 < 0.0000000000000002 ***
Sunlight           -0.00996    0.01842   -0.54               0.5889    
 [ reached getOption("max.print") -- omitted 18 rows ]
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.372 on 1252 degrees of freedom
Multiple R-squared:  0.923, Adjusted R-squared:  0.921 
F-statistic:  469 on 32 and 1252 DF,  p-value: <0.0000000000000002

The MSE for OLS regression is 0.135

7 Chapter 7: Conclusion

8 Chapter 8: Bibliography

Cases in the U.S. (2020, August 01). Retrieved August 01, 2020, from https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html